Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBG-3271 allow in memory buckets to persist #16

Merged
merged 5 commits into from
Nov 16, 2023
Merged

CBG-3271 allow in memory buckets to persist #16

merged 5 commits into from
Nov 16, 2023

Conversation

torcolvin
Copy link
Collaborator

@torcolvin torcolvin commented Nov 8, 2023

  • in memory buckets will now live for the lifecycle of the program until Bucket.CloseAndDelete is called. This facilitates bucket closing without removing the data, which is a common use in Sync Gateway tests that use persistent config to update database configuration.
  • When a bucket is first created, a copy is stored in the bucket registry, and this is never closed until:
    • in memory: CloseAndDelete is called
    • on disk: Close is called to bring refcount of buckets to 0
  • force bucket name to be defined, and do not let multiple copies of a persistent bucket to be opened if they are different paths. Note, this is a blunt instrument, and it is indiscriminate to path manipulations. It does path matching on lexography, not normalized paths. The idea is to be safer. Sync Gateway is not architected to support multiple buckets with the same name that do not have the same backing data.

Implementation:

The global state of a bucket is representative of two things:

  • *sql.DB represnting the underlying sqlite connection
  • dcpFeeds for each connection that exists

The sql.DB connection can be opened multiple times on the same path, and this was the original implementation of rosmar. However, it can't be opened multiple times for in memory files except via cache=shared query parameter to sqlite3_open. This ended up causing behavior that I didn't understand, and is not typically a supported in sqlite, since multiple databases are managed with a WAL when on disk.

DCP feeds in rosmar work by having pushing events on a CUD operation to a queue, which can be read by any running feeds. Instead of having separate feeds for each copy of Bucket and publishing them via bucketsAtUrl, we now only have a single canonical set of bucket feeds. Moved this field from a Collection to Bucket. This addresses https://issues.couchbase.com/browse/CBG-3540

Whether a bucket is open or closed is controlled by Bucket._db() that is called by any CRUD operations. A Collection has a pointer to its parent bucket. Each Bucket opened now will create a Collection dynamically, but these share pointers to the cached in memory versions.

Depends on couchbase/sg-bucket#110

- in memory buckets will now live for the lifecycle of the program until
  Bucket.CloseAndDelete is called. This facilitates bucket closing without
  removing the data, which is a common use in Sync Gateway tests that
  use persistent config to update database configuration.
- When a bucket is first created, a copy is stored in the bucket
  registry, and this is never closed until:
  	- in memory: CloseAndDelete is called
	- on disk: Close is called to bring refcount of buckets to 0
- force bucket name to be defined, and do not let multiple copies of a
  persistent bucket to be opened if they are different paths. Note, this
  is a blunt instrument, and it is indiscriminate to path manipulations.
  It does path matching on lexography, not normalized paths. The idea is
  to be safer. Sync Gateway is not architected to support multiple
  buckets with the same name that do not have the same backing data.

Implementation:

The global state of a bucket is representative of two things:

- *sql.DB represnting the underlying sqlite connection
- dcpFeeds for each connection that exists

The sql.DB connection can be opened multiple times on the same path, and
this was the original implementation of rosmar. However, it can't be
opened multiple times for in memory files except via cache=shared query
parameter to sqlite3_open. This ended up causing behavior that I didn't
understand, and is not typically a supported in sqlite, since multiple
databases are managed with a WAL when on disk.

DCP feeds in rosmar work by having pushing events on a CUD operation to a
queue, which can be read by any running feeds. Instead of having
separate feeds for each copy of Bucket and publishing them via
`bucketsAtUrl`, we now only have a single canonical set of bucket feeds.
Moved this field from a Collection to Bucket.

Whether a bucket is open or closed is controlled by Bucket._db() that is
called by any CRUD operations. A Collection has a pointer to its parent
bucket. Each Bucket opened now will create a Collection dynamically, but
these share pointers to the cached in memory versions.
}
// Copy the slice before mutating, in case a client is iterating it:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whereabouts do we iterate over the slice without holding the lock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was carried over from the original implementation, I've simplified it for refcounting.

bucket_api.go Outdated
bucket.closed = true
}

func (bucket *Bucket) _closeAllInstances() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the 'AllInstances' here, since it looks like it's just closing this single bucket's connection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this closeSQliteDB although it really closes more than that. If this function is called on any bucket, it will shut down all Bucket that have a matching name. Any suggestion of a better name to reflect would be welcome.

collections: make(map[sgbucket.DataStoreNameImpl]*Collection),
collectionFeeds: make(map[sgbucket.DataStoreNameImpl][]*dcpFeed),
mutex: &sync.Mutex{},
nextExp: new(uint32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this usage - what's the advantage of setting new(uint32) instead of leaving nil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just initializing a an int pointer to 0, is there a more canonical way without writing a function?

var bucketRegistry = map[string][]*Bucket{} // Maps URL to slice of Buckets at that URL
var bucketRegistryMutex sync.Mutex // Thread-safe access to bucketRegistry
type bucketRegistry struct {
byURL map[string][]*Bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a []*Bucket, or is it just used for ref counting and so could be something like []bool or []struct{}.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this with integer refcounting.

var bucketRegistryMutex sync.Mutex // Thread-safe access to bucketRegistry
type bucketRegistry struct {
byURL map[string][]*Bucket
inMemoryRef map[string]*Bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed the names are a bit confusing if 'byURL' is actually the ref count, and 'inMemoryRef' is the lookup of bucket by URL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked names of bucketCount for refcount and buckets for the actual cached copy of the bucket.

- create a copy of the bucket in getCachedBucket so caller doesn't have
  to remember to make a copy
Copy link
Contributor

@adamcfraser adamcfraser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. One suggestion for additional documentation on the bucket registry as discussed, but otherwise looks good.

// `realpath` or `fcntl(F_GETPATH)`) and use the canonical path as the key.
// Unfortunately Go doesn't seem to have an API for that.
// * In Memory bucket: bucket is not deleted until any Bucket's CloseAndDelete is closed.
// * On disk bucket: bucket is deleted from registry when all there are no open copies of the bucket in memory. Unlike in memory bucket, the bucket will stay persisted on disk to be reopened.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, a bit more context on how buckets are managed here will make this easier to maintain,. In particular I'm thinking of explicitly documenting:

  • what's shared between bucket copies (almost everything)
  • what's not shared, and why (in particular 'closed')

@torcolvin torcolvin merged commit adb4806 into main Nov 16, 2023
12 checks passed
@torcolvin torcolvin deleted the CBG-3271 branch November 16, 2023 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants